PREPARED BY POWELL A. NDLOVU¶

Bosherstone Lakes Data Analysis Jupyter Note book¶

ANALYSING ATMOSPHERIC TEMPERATURE VS WATER TEMPERATURE AT STACKPOLE¶

We will data from https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data¶

Historic station data from Met Office “Historic station data” (Aberporth) the nearest to stackpole¶

Aberporth Location: 224100E 252100N, Lat 52.139 Lon -4.570, 133 metres amsl Estimated data is marked with a * after the value. Missing data (more than 2 days missing in month) is marked by ---. Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.

Importing our Libraries¶

In [138]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

Loading data set from the Met Website¶

In [3]:
df = pd.read_csv('Met_data.csv')
In [5]:
Met = df

A snipet of the first 5 rows of the data¶

In [7]:
df.head()
Out[7]:
Year Month Month_Name max_temp min_temp rain sun
0 1941 1 January NaN NaN 74.7 NaN
1 1941 2 February NaN NaN 69.1 NaN
2 1941 2 March NaN NaN 76.2 NaN
3 1941 4 April NaN NaN 33.7 NaN
4 1941 5 May NaN NaN 51.3 NaN

As you can see there are missing values we need to clean our data¶

LETS CHECK THE NUMBER OF MISSING VALUES¶

In [9]:
df.isnull().sum()
Out[9]:
Year           0
Month          0
Month_Name     0
max_temp      18
min_temp      12
rain           0
sun           14
dtype: int64

Percentage of missing values from each column¶

In [20]:
df.isnull().sum()/100
Out[20]:
Year          0.00
Month         0.00
Month_Name    0.00
max_temp      0.18
min_temp      0.12
rain          0.00
sun           0.14
dtype: float64
In [22]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017 entries, 0 to 1016
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Year        1017 non-null   int64  
 1   Month       1017 non-null   int64  
 2   Month_Name  1017 non-null   object 
 3   max_temp    999 non-null    float64
 4   min_temp    1005 non-null   float64
 5   rain        1017 non-null   float64
 6   sun         1003 non-null   object 
dtypes: float64(3), int64(2), object(2)
memory usage: 55.7+ KB
In [26]:
# Let's see if we have any missing data, luckily we don't!
sns.heatmap(df.isnull(), yticklabels = False, cbar = False, cmap="Blues")
Out[26]:
<Axes: >
No description has been provided for this image

Blue patches indicate presence of missing values at the top of each columns¶

We will be dealing with time series analysis hence we need to convert our date columns to date data format understood by python¶

In [32]:
# Build a monthly period index, and an actual date
df["date"] = pd.to_datetime(dict(year=df["Year"], month=df["Month"], day=15))
#df = df.set_index("date").sort_index()
In [34]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017 entries, 0 to 1016
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Year        1017 non-null   int64         
 1   Month       1017 non-null   int64         
 2   Month_Name  1017 non-null   object        
 3   max_temp    999 non-null    float64       
 4   min_temp    1005 non-null   float64       
 5   rain        1017 non-null   float64       
 6   sun         1003 non-null   object        
 7   date        1017 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(2), object(2)
memory usage: 63.7+ KB

Look under Dtype we now have a date formate under date although maintaining the existing column¶

In [37]:
df.head()
Out[37]:
Year Month Month_Name max_temp min_temp rain sun date
0 1941 1 January NaN NaN 74.7 NaN 1941-01-15
1 1941 2 February NaN NaN 69.1 NaN 1941-02-15
2 1941 2 March NaN NaN 76.2 NaN 1941-02-15
3 1941 4 April NaN NaN 33.7 NaN 1941-04-15
4 1941 5 May NaN NaN 51.3 NaN 1941-05-15

Under date dont mind 15 its just a place holder so that python can read as date because in our data set we dont have daily records.¶

In [ ]:
 

Lets Remove Na Values because they will affect our models¶

In [42]:
# Drop rows with NaN values
df = df.dropna()
In [44]:
df.head()
Out[44]:
Year Month Month_Name max_temp min_temp rain sun date
12 1942 1 January 5.8 2.1 114.0 58 1942-01-15
13 1942 2 February 4.2 -0.6 13.8 80.3 1942-02-15
14 1942 2 March 9.7 3.7 58.0 117.9 1942-02-15
15 1942 4 April 13.1 5.3 42.5 200.1 1942-04-15
16 1942 5 May 14.0 6.9 101.1 215.1 1942-05-15

NAN Values gone¶

In [ ]:
 

Lets plot our temperature (Max and Min Temp)data from Met Offuce¶

In [48]:
import plotly.express as px

# For NASA POWER daily df
fig = px.line(df.reset_index(), x="date", y=["max_temp","min_temp"],
              title="Stackpole (51.62, -4.93) – Daily Temperature (NASA POWER)")
fig.show()
C:\Users\User\anaconda3\Lib\site-packages\kaleido\_sync_server.py:11: UserWarning:



Warning: You have Plotly version 5.24.1, which is not compatible with this version of Kaleido (1.1.0).

This means that static image generation (e.g. `fig.write_image()`) will not work.

Please upgrade Plotly to version 6.1.1 or greater, or downgrade Kaleido to version 0.2.1.


Now lets load in our water temperature data that was separated from our main survey¶

In [51]:
temp_water = pd.read_csv('df_temp.csv')
In [ ]:
 
In [53]:
temp_water.head()
Out[53]:
Sampling_Date Conductivity_At_20c Temperature pH
0 7/26/1977 360.0 16.0 7.9
1 7/26/1977 300.0 16.5 8.9
2 7/26/1977 380.0 18.0 8.4
3 7/26/1977 600.0 12.0 7.7
4 7/26/1977 390.0 18.5 8.2

Lets convert our sampling date to date time that is recognisable by python¶

In [60]:
temp_water['Date'] = pd.to_datetime(temp_water['Sampling_Date'])
In [62]:
temp_water.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3316 entries, 0 to 3315
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Sampling_Date        3316 non-null   object        
 1   Conductivity_At_20c  2061 non-null   float64       
 2   Temperature          2550 non-null   float64       
 3   pH                   2943 non-null   float64       
 4   Date                 3316 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 129.7+ KB

Now we want to create a column WERE we separate the Month from the year¶

In [69]:
temp_water['Month'] = temp_water['Date'].dt.month
temp_water['Year'] = temp_water['Date'].dt.year
In [71]:
temp_water.head()
Out[71]:
Sampling_Date Conductivity_At_20c Temperature pH Date Month Year
0 7/26/1977 360.0 16.0 7.9 1977-07-26 7 1977
1 7/26/1977 300.0 16.5 8.9 1977-07-26 7 1977
2 7/26/1977 380.0 18.0 8.4 1977-07-26 7 1977
3 7/26/1977 600.0 12.0 7.7 1977-07-26 7 1977
4 7/26/1977 390.0 18.5 8.2 1977-07-26 7 1977
In [73]:
temp_water.head()
Out[73]:
Sampling_Date Conductivity_At_20c Temperature pH Date Month Year
0 7/26/1977 360.0 16.0 7.9 1977-07-26 7 1977
1 7/26/1977 300.0 16.5 8.9 1977-07-26 7 1977
2 7/26/1977 380.0 18.0 8.4 1977-07-26 7 1977
3 7/26/1977 600.0 12.0 7.7 1977-07-26 7 1977
4 7/26/1977 390.0 18.5 8.2 1977-07-26 7 1977

IMPACT OF ATMOSPHERIC TEMPERATURE ON WATER TEMPERATURE AT STACKPOLE LAKE¶

In [75]:
import matplotlib.pyplot as plt
import seaborn as sns

Now we want to calculate the average for atmospheric Temperature and Water Temperature¶

We will group by by YEAR AND MONTH To get Max Atmosperic and Water Temperature by Month and Year¶
In [79]:
#calculate monthly average for Atmospheric Temp and Water Temp 
df_atm = df.groupby(['Year', 'Month'])['max_temp'].mean().reset_index()

#calculate monthly average Water Temperature
df_water = temp_water.groupby(['Year', 'Month'])['Temperature'].mean().reset_index()

Lets Plot them in a line plot¶

Enjoy hovering around figures¶

In [84]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))

# Plot for Atmospheric Temperature
sns.lineplot(
    x='Month',
    y='max_temp',
    data=df_atm,
    marker='o',
    color='red',
    label='Atmospheric Temperature'
)

# Plot for Water Temperature
sns.lineplot(
    x='Month',
    y='Temperature',
    data=df_water,
    marker='o',
    color='blue',
    label='Water Temperature'
)

plt.title('Monthly Average Water Temperature vs Atmospheric Temperature', fontsize=14)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Monthly Water Temperature and Atmospheric Temperature (°C)', fontsize=12)
plt.xticks(
    ticks=range(1, 13),
    labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)

# ✅ Save high-quality image before showing
plt.savefig("Water_vs_Atmospheric_Temperature_LinePlot.png", dpi=400, bbox_inches='tight')

plt.show()
No description has been provided for this image

Lets see the column names of each Data set¶

In [86]:
df_water.columns
Out[86]:
Index(['Year', 'Month', 'Temperature'], dtype='object')
In [88]:
df_atm.columns
Out[88]:
Index(['Year', 'Month', 'max_temp'], dtype='object')

Now we want to create a QUADRATIC REGRESSION ANALYSIS BETWEEN ATMOSPERIC TEMPERATURE AND WATER TEMPERATURE¶

We want to understand how Atmosperic Temperature Infuences Water Temperature¶

In [93]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# ---------------------------------------------------------
# 1️⃣ Merge the two datasets by Month
# ---------------------------------------------------------
# df = dataset containing 'Month' and 'max_temp' (Atmospheric Temperature)
# df_2 = dataset containing 'Month' and 'Temperature' (Water Temperature)

merged = pd.merge(df_atm[['Month', 'max_temp']], df_water[['Month', 'Temperature']], on='Month')

# Clean dataset: remove NaN, inf, and duplicates
merged = merged.replace([np.inf, -np.inf], np.nan).dropna(subset=['max_temp', 'Temperature'])
merged = merged.drop_duplicates(subset=['max_temp', 'Temperature'])

# ---------------------------------------------------------
# 2️⃣ Fit Quadratic Regression Model (y = a*x² + b*x + c)
# ---------------------------------------------------------
x = merged['max_temp']
y = merged['Temperature']

# Fit quadratic model
coeffs = np.polyfit(x, y, deg=2)
a, b, c = coeffs

# Predicted values and R²
merged['predicted'] = a * x**2 + b * x + c
r2 = r2_score(y, merged['predicted'])

# ---------------------------------------------------------
# 3️⃣ Visualization
# ---------------------------------------------------------
plt.figure(figsize=(8,6))

# Scatter plot (observed data)
sns.scatterplot(
    x='max_temp', y='Temperature',
    data=merged,
    color='royalblue',
    s=70,
    label='Observed Data'
)

# Quadratic regression curve
x_fit = np.linspace(x.min(), x.max(), 200)
y_fit = a * x_fit**2 + b * x_fit + c
plt.plot(x_fit, y_fit, color='red', linewidth=2.5, label='Quadratic Fit')

# ---------------------------------------------------------
# 4️⃣ Formatting and legend positioning
# ---------------------------------------------------------
plt.title("Quadratic Relationship Between Atmospheric and Water Temperature", fontsize=14)
plt.xlabel("Atmospheric Temperature (°C)", fontsize=12)
plt.ylabel("Water Temperature (°C)", fontsize=12)

# Annotate equation and R² on plot
plt.text(
    0.05, 0.93,
    f"y = {a:.4f}x² + {b:.3f}x + {c:.3f}\nR² = {r2:.3f}",
    transform=plt.gca().transAxes,
    fontsize=11,
    verticalalignment='top',
    bbox=dict(boxstyle='round,pad=0.4', facecolor='white', alpha=0.7)
)

# Improved legend position and styling
plt.legend(
    loc='lower right',
    fontsize=10,
    frameon=True,
    facecolor='white',
    framealpha=0.9
)

plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()

# ✅ Save high-resolution plot
plt.savefig("Quadratic_Regression_Atmospheric_vs_Water_Temp_FixedLegend.png", dpi=400, bbox_inches='tight')
plt.show()

# ---------------------------------------------------------
# 5️⃣ Display regression results in console
# ---------------------------------------------------------
print("📊 Quadratic Regression Equation:")
print(f"Water Temp = {a:.4f} * (Air Temp)^2 + {b:.4f} * (Air Temp) + {c:.4f}")
print(f"R² = {r2:.3f}")
No description has been provided for this image
📊 Quadratic Regression Equation:
Water Temp = -0.0001 * (Air Temp)^2 + 0.9404 * (Air Temp) + 0.6502
R² = 0.714

71 % OF THE WATER TEMPERATURE VARIATION CAN BE EXPLAINED BY ATMOSPHERIC TEMPERATURE A CLIMATE CHANGE PROBLEM¶

In [96]:
df_water.columns
Out[96]:
Index(['Year', 'Month', 'Temperature'], dtype='object')
In [98]:
df_atm.columns
Out[98]:
Index(['Year', 'Month', 'max_temp'], dtype='object')
In [100]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

# KDE for atmospheric temperature (df)
sns.kdeplot(
    data=df_atm,
    x='max_temp',
    fill=True,
    color='red',
    alpha=0.4,
    linewidth=2,
    label='Atmospheric Temperature'
)

# KDE for water temperature (df_2)
sns.kdeplot(
    data=df_water,
    x='Temperature',
    fill=True,
    color='blue',
    alpha=0.4,
    linewidth=2,
    label='Water Temperature'
)

plt.title('Kernel Density Estimate of Atmospheric vs. Water Temperature', fontsize=14)
plt.xlabel('Temperature (°C)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()

# Optional: save image for your report
plt.savefig("KDE_Temperature_Comparison.png", dpi=300, bbox_inches='tight')

plt.show()
No description has been provided for this image
In [102]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))

# Plot for year 1 (2024) - Atmospheric Temperature
sns.lineplot(
    x='Year',
    y='max_temp',
    data=df_atm,
    marker='o',
    color='red',
    label='Atmospheric Temperature'
)

# Plot for year 2 (2000) - Water Temperature
sns.lineplot(
    x='Year',
    y='Temperature',
    data=df_water,
    marker='o',
    color='blue',
    label='Water Temperature'
)

# Add labels and title
plt.title('Yearly Average Water Temperature vs Atmospheric Temperature', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Yearly Water Temperature and Atmospheric Temperature (°C)', fontsize=12)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)

# ✅ Save the figure (high resolution)
plt.savefig("Yearly_Water_vs_Atmospheric_Temperature.png", dpi=400, bbox_inches='tight')

# Display the plot
plt.show()
No description has been provided for this image

IMPACT OF ATMOSPHERIC TEMPERATURE ON pH and Conductivity At Stackpole¶

NOW WE WILL MERGE THE MET DATA SET AND THE WATER TEMPERATURE DATA SET TO ACHIEVE THE ABOVE¶

In [107]:
Met.head()
Out[107]:
Year Month Month_Name max_temp min_temp rain sun date
0 1941 1 January NaN NaN 74.7 NaN 1941-01-15
1 1941 2 February NaN NaN 69.1 NaN 1941-02-15
2 1941 2 March NaN NaN 76.2 NaN 1941-02-15
3 1941 4 April NaN NaN 33.7 NaN 1941-04-15
4 1941 5 May NaN NaN 51.3 NaN 1941-05-15
In [113]:
 Met = Met[['Year', 'Month', 'max_temp','rain']]
In [111]:
df_atm.head()
Out[111]:
Year Month max_temp
0 1942 1 5.80
1 1942 2 6.95
2 1942 4 13.10
3 1942 5 14.00
4 1942 6 16.20
In [115]:
Met.head()
Out[115]:
Year Month max_temp rain
0 1941 1 NaN 74.7
1 1941 2 NaN 69.1
2 1941 2 NaN 76.2
3 1941 4 NaN 33.7
4 1941 5 NaN 51.3
In [117]:
Met = Met.dropna()
In [119]:
Met.head()
Out[119]:
Year Month max_temp rain
12 1942 1 5.8 114.0
13 1942 2 4.2 13.8
14 1942 2 9.7 58.0
15 1942 4 13.1 42.5
16 1942 5 14.0 101.1
In [127]:
temp_water.head()
Out[127]:
Sampling_Date Conductivity_At_20c Temperature pH Date Month Year
0 7/26/1977 360.0 16.0 7.9 1977-07-26 7 1977
1 7/26/1977 300.0 16.5 8.9 1977-07-26 7 1977
2 7/26/1977 380.0 18.0 8.4 1977-07-26 7 1977
3 7/26/1977 600.0 12.0 7.7 1977-07-26 7 1977
4 7/26/1977 390.0 18.5 8.2 1977-07-26 7 1977

MERGING¶

In [129]:
merged = pd.merge(
    Met[['Year', 'Month', 'max_temp', 'rain']],
    temp_water[['Year', 'Month', 'Conductivity_At_20c', 'Temperature', 'pH']],
    on=['Year', 'Month'],
    how='inner'
)
In [131]:
merged.head()
Out[131]:
Year Month max_temp rain Conductivity_At_20c Temperature pH
0 1977 7 18.1 24.9 360.0 16.0 7.9
1 1977 7 18.1 24.9 300.0 16.5 8.9
2 1977 7 18.1 24.9 380.0 18.0 8.4
3 1977 7 18.1 24.9 600.0 12.0 7.7
4 1977 7 18.1 24.9 390.0 18.5 8.2

We will have to scale and standardise our data set bevause Conductivity, Temperature and rain have different scales¶

THE GRAPH BELOW IS SHOWING INTERACTION BETWEEN THE ATMOSPHERIC PARAMETERS FROM THE MET OFFICE AND WATER PARAMETERS FROM STACKPOLE¶

In [146]:
# Ensure Month is numeric and sorted
merged['Month'] = pd.to_numeric(merged['Month'], errors='coerce')
merged = merged.dropna(subset=['Month']).sort_values(['Year', 'Month'])

# ---------------------------------------------------------
# 3️⃣ Scale (Min–Max 0–1) for fair visual comparison
# ---------------------------------------------------------
scaler = MinMaxScaler()
cols_to_scale = ['max_temp', 'rain', 'Conductivity_At_20c', 'Temperature', 'pH']
merged[[col + '_scaled' for col in cols_to_scale]] = scaler.fit_transform(merged[cols_to_scale])

# ---------------------------------------------------------
# 4️⃣ Plot scaled variables on same line graph
# ---------------------------------------------------------
plt.figure(figsize=(14, 8))

sns.lineplot(x='Month', y='max_temp_scaled', data=merged, marker='o', label='Atmospheric Temperature (scaled)', color='blue')
sns.lineplot(x='Month', y='rain_scaled', data=merged, marker='o', label='Rainfall (scaled)', color='skyblue')
sns.lineplot(x='Month', y='Conductivity_At_20c_scaled', data=merged, marker='o', label='Conductivity (scaled)', color='orange')
sns.lineplot(x='Month', y='Temperature_scaled', data=merged, marker='o', label='Water Temperature (scaled)', color='red')
sns.lineplot(x='Month', y='pH_scaled', data=merged, marker='o', label='pH (scaled)', color='green')

plt.title('Scaled Monthly Comparison:Met Office Atmospheric Parameters vs Stack Pole Water Parameters', fontsize=15)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Scaled Values (0–1)', fontsize=12)
plt.xticks(
    ticks=range(1, 13),
    labels=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
)
plt.legend(title='Variables', fontsize=10)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()

# ✅ Save high-resolution figure
plt.savefig("Scaled_Atmospheric_vs_Water_Variables.png", dpi=400, bbox_inches='tight')

plt.show()

# ---------------------------------------------------------
# 5️⃣ Optional: inspect merged and scaled data
# ---------------------------------------------------------
print(merged.head())
No description has been provided for this image
   Year  Month  max_temp  rain  Conductivity_At_20c  Temperature   pH  \
0  1977      7      18.1  24.9                360.0         16.0  7.9   
1  1977      7      18.1  24.9                300.0         16.5  8.9   
2  1977      7      18.1  24.9                380.0         18.0  8.4   
3  1977      7      18.1  24.9                600.0         12.0  7.7   
4  1977      7      18.1  24.9                390.0         18.5  8.2   

   max_temp_scaled  rain_scaled  Conductivity_At_20c_scaled  \
0          0.78607     0.113699                    0.259995   
1          0.78607     0.113699                    0.215859   
2          0.78607     0.113699                    0.274707   
3          0.78607     0.113699                    0.436537   
4          0.78607     0.113699                    0.282063   

   Temperature_scaled  pH_scaled  
0            0.542373   0.737864  
1            0.559322   0.834951  
2            0.610169   0.786408  
3            0.406780   0.718447  
4            0.627119   0.766990  

THE END¶

In [ ]: